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HONEST CONFIDENCE REGIONS FOR LOGISTIC REGRESSION WITH A 

LARGE NUMBER OF CONTROLS 

ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV AND YING WEI 



C^ ' Abstract. This paper considers inference in logistic regression models with high dimensional 

CN ! 

data. We propose new inference tools on constructing confidence regions for a regression pa- 

^ l' rameter of primary interest qq. These tools allow the total number of controls to exceed the 

■^^ ' sample size when only a subset of the controls are needed for the regression. Importantly, we 

1/^ ■ show that these resulting confidence regions are "honest" in the formal sense that they hold 

uniformly over many data-generating processes, and do not rely on traditional consistent model 

selection arguments. Consequently, the inferential results derived are robust against model 

selection mistakes. 

Key words: uniformly valid inference, instruments, double selection, Neymanization, optimality, 
"j^ ' sparsity, model selection 

> 

\Q • 1. Introduction 

en 

.• ■ The literature on high-dimensional generalized linear models has been growing rapidly during 

^D ' the past years |191 [H] . A striking feature of this literature is to achieve consistency while they 

allow the total number of covariates p to be large relative to the sample size n with potentially 
p ^ n. The main underlying assumption is that the number of relevant controls is bounded 
^ , by s ^ n. Different estimators have been proposed and analyzed. The theoretical guarantees 

C^ . achieved for them are analogous to the corresponding estimators for linear models. Results 

include prediction error consistency, consistency of the parameter estimates in £fc-norms, variable 
selection consistency, and minimax-optimal rates. As in the case of linear models, £i-penalized 
estimators achieve desirable theoretical guarantees and, in settings with a convex loss functions, 
they are also computationally tractable even in this high-dimensional setting. 

Several papers have focused on high-dimensional logistic models trying to exploit its structure. 
£i-penalized logistic regressions models were studies in [9], [1], and [11]. Group logistic regression 
were studied in [13] and [11]. Ising models were considered in [18] and connections with robust 
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1-bit recover were derived in [17J. These works derive rates of convergence for the coefficients, 
prediction error consistency, and variable selection consistency under various conditions. 

This paper also considers inference in logistic models with high dimensional data where p^ n 
is allowed. We focus on the construction of confidence regions for a parameter of interest ao 
which measures the marginal impact of a policy variable. Importantly, we show that these results 
hold uniformly over many data-generating processes. In particular, the inferential results derived 
here manage to control model selection mistakes without assuming all the relevant coefficients 
are not too close to zero. Such "Separation from zero assumption is making commonly assumed 
in the model selection literature but often unrealistic in many applications of interest. 

These regions are constructed based on a three step procedure. The ffist and second steps 
build upon the literature of ^i-penalized methods to estimate a control function and an instru- 
ment variable. The third step suitably combines these estimates. We propose one procedure 
based on instrumental logistic regression and another based on double selection logistic regres- 
sion. We verify the validity of these procedures. However our proofs reveal that many different 
estimators can be used on each step as long as a required accuracy is achieved. For example, 
the second step can be based on Lasso, Dantzig selector. Square-root Lasso, the corresponding 
post-model selection estimators or others, while the third step can be implemented by a "1-step" 
correction. Therefore several implementations are possible. 

This work relates to other inferential results for high-dimensional models that achieve uni- 
formly valid confidence regions. A common thread among these results is not to rely on tra- 
ditional consistent model selection arguments. For partially linear models [6] performs dou- 
ble model selection to achieve uniformity properties post-model selection while [25 uses ii- 
regularization methods. |2] establishes results for instrumental variables. [5] derives similar 
results focusing on the least absolute deviation setting under homoscedasticity. 

The analysis of the proposed confidence regions deals with a non-linear model, post-model 
selection estimators and also needs to address the intrinsic heteroscedasticity of the logistic 
model which is not present in [5J. Due to heteroscedasticity, in order to achieve efficiency, the 
estimation of instruments is carried over with Post-Lasso methods with estimated weights which 
requires a non-traditional analysis since these weights depend on the dependent variable. Also, 
in order to use post-model selection estimators that exhibit a better finite sample behavior due 
the smaller bias, we also develop new results for logistic regression, namely sparsity bounds on 
the number of non-zero components selected by ^i-penalized logistic regression that do not rely 
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on the irrepresentability condition as in [Ij, and rates of convergence for the post-model selection 
logistic regression. We believe that each of these results is of independent interest. 

The paper is organized as follows. Section [2] formally presents the model and the proposed 
estimator. Section [3] provides primitive conditions and the statements of our main results on 
the validity of the confidence regions. The proofs of these results are presented in Appendix IB] 
and are based on carefully verifying the high-level conditions of a general result in Appendix [Xj 
Appendix [C] collects results on Lasso and Post-Lasso with estimated weights (Appendix IC.ip 
and ^i-penalized Logistic regression and post model selection Logistic regression (Appendix lC.2p . 
Auxiliary inequalities are stated in Appendix [Dl 

1.1. Notation. Denote by (rj,P) the underlying probability space. The notation E„[-] de- 
notes the average over index 1 ^ i ^ n, i.e., it simply abbreviates the notation '^"^ X]r=i[']- 
For example, E„[x?] = n'^ Y17=i ^fj ■ Moreover, we use the notation E[-] = E„[E[-]]. For 
example, E[vf] = n^'^J2'i=iHvf]. For a function f : R x R x W ^ R, we write G„(/) = 
™~"^^^I]r=i(/(yi'^*'^i)~^[/(yi'^«'^i)])- The/2-normis denoted by ||-||, and the /o-norm, ||-||o, 
denotes the number of non-zero components of a vector. Denote by || • ||oo the maximal absolute 



element of a vector. For a sequence (tj)r=i' ^^ write ||ti||2,n = \/^n[tf]- For example, for a 
vector 6 £ M^, ||x'j(5||2,n = \/^n[{x'i6y] denotes the prediction norm of 5. Given a vector 5 G R^, 
and a set of indices T C {1, . . . ,p}, we denote by 5t G R^ the vector such that {6t)j = Sj if j £ T 
and {6t)j = if j ^ T. Also we write the support of 6 as support((5) = {j £ {1, ...,p} : 5j ^ 0}. 
We use the notation (a)+ = max{a, 0}, a\/ b = max{a, b}, and a Ah = min{a, b}. We also use 
the notation a < 6 to denote a ^ cb for some constant c > that does not depend on n; and 
a <p b to denote a = Op{b). 

We assume that the quantities such as p (the dimension of Xj), s (a bound on the numbers 
of non-zero elements of /3o and ^o)) and hence yi,Xi, Po,9o,T and Tg^ are all dependent on the 
sample size n, and allow for the case where p = pn ^ 00 and s = s„— >-c«asn— t-oo. However, 
for the notational convenience, we shall omit the dependence of these quantities on n. 



2. Setup and Method 

We consider a generalized linear model in which the binary outcome of interest y relates to a 
scalar treatment/policy d and p-dimensional control x through a link function G 



E[y\x,d] = G{dao + x'l3o). (2.1) 
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This work studies the logistic regression in which the hnk function is given by G{t) = exp(t)/{l + 
exp(t)}. We aim to perform statistical inference on the coefficient ao which represents the impact 
of the treatment on the outcome variable through the link function. 

Let {{yi,di,Xi) : i = 1, . . . , n} be a random sample, independent across i, obeying the model 
()2.ip . To achieve robustness with respect to model selection mistakes we exploit the following 
decomposition for the weighted treatment 

y/uTidi = y/uTix'iOo + Vi where E[y^Vj | Xj] = 0. (2.2) 

The weights t^j's in the decomposition above are the conditional variance of the outcome, namely 

Wi := Gi{l - Gi), where Gj := G{diao + x-/3o). 

The weights are needed to induce the proper orthogonality condition that can "immunize" the 
estimation of unavoidable model selection mistakes when uniformity is desired. Such decom- 
position always exists provided that the second moment of the treatment exist. Notably the 
weights are also a function of the treatment unless oq = 0. 

The proposed method builds upon three difi^erent procedures. First it computes an estimate 
for x'j/3o- Second it estimates an instrument zoi = zo{di,Xi), i.e. a variable correlated to the 
treatment and orthogonal to the other controls where orthogonality is with respect the weighting 
Wi, namely 'E[widiZoi] ^ and E[u'j2;oj I ^i] = 0. Third it combines these estimates to estimate 
ctQ. Several choices for different methods and instruments are possible. We study two possi- 
bilities. In order to state the proposed methods, denote the (negative) log-likelihood function 
associated with the logistic link function as 

A(a, /3) = E„[log{l + exp{dia + x-/3)} - yi{dia + x-/3)], 

and the £i-norm a vector ||r/||i = X^jlTi \Vj\- 

Table [T] displays the first proposed implementation of the strategy outlined in the previous 
paragraph. The estimation in Step 1 is based on post-model selection logistic regression where 
the model is selected based on £i-penalized logistic regression. Step 2 is based on a Post-Lasso 
with estimated weights. Step 3 is based on an instrumental logistic regression. The use of post- 
model selection estimators over their penalized counterpart tends to have a better finite sample 
performance due to the smaller bias. Two confidence regions for oq are constructed in Tabled) 
CTZd is based on the asymptotic normality of the estimator a. The region CTZj is created based 
on an inverse statistics. 
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Honest Confidence Regions based on Optimal Instrument 
Step 1 Run Post-Lasso-Logistic of yj on di and xf 

{a,l3)£ argmin A(a,/3) + ^||(a,/3)||i 

(5, /3)g argmin A(a,/3) : support(/3) C support(;5) 

a,l3 

Keep the value x'^fS and weight Wi := G{dia + xj/3){l — G{dia + x'^/3)}, i = 1, . . . ,n. 
Step 2 Run Post-Lasso-OLS of y/wldi on ^/wlxi. 



ee argmin En[wi{di - x'-Oy] + ^ 



'fl^2i I Aal 



1 



9g argmin Kn[wi{di — x'^6)] : support(0) C support(0) 



Keep the residual % := di — x[6, i = 1, . . . , n. 
Step 3 Run Instrumental Logistic Regression of yi — x'^f3 on di using Zi as the instrument for di 

. . , . N , r r \ I En[ {Vi - Gjdia + x'M^t ] P 
a G arg mt L„{a), where L„ a = — 

-e^ E^[{yi-Gidia + x'SV2f] 

where A = [a — Clog" n,a + G log" n]. Define the confidence regions with asymptotic 
coverage 1 — ^ 

CTZd = {a G M : |a - dK ^n^^^l - C/2)/V^} 
CTZi = {a £A: nLn{a) ^ (1 - ^-quantile of x^(l)}- 

Table 1. The algorithm has three steps: (1) a preliminary estimator of the impact of 
the covariates, (2) estimation of residuals which are nearly orthogonal to the weighted 
covariates, and (3) use the residuals as instruments to correct the preliminary estimator 
of the omitted variable bias. We assume the normalization E„[a;| ] — 1 and E„[(i^] = 1, 
and Ai = ^v^<i>^^(l — 0.01/{pV n}) and A2 = n^/'^. The estimator of the variance is 
given by dl = {E„[w,diZi]}-iE„[{y, - G{d,a + x'j3)Yzj]{¥.n[w^di%]}-^. 



Table [2] describes an alternative proposal to construct a confidence region which is reminiscent 
of the double selection method proposed in [B] for partial linear models. The method replaces 
Step 3 in Tabled] with a logistic regression with all covariates selected in Steps 1 and 2. Although 
this approach might seem substantially different, it turns out that this approach uses the optimal 
instrument implicitly. In fact it can be seen as an iterated version of the previous method. The 
differences and connections between the two methods are discussed later in Section 15. li 

Comment 2.1 (Choice of Valid Instruments). An instrument zq is valid \i'Ej[wiZQi \ Xi] = and 
E[widjZoj] 7^ 0. We note that there are several valid instruments to choose from. The algorithm 
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Honest Confidence Region based on Double Selection 
Step 1 Run Post-Lasso-Logistic of yi on di and x^: 

(aJ)G argmin A(a,/3) + ^||(a,/3)||i 

a.p 

{a,/3)£ argmin A(a,/3) : support(/3) C support(;5) 

a, 13 

Construct the weight ivi := G{dia + x'^/3){l — G{dia + a;'j/3)}, i = 1, . . . ,n. 
Step 2 Run Lasso-OLS of ^/wldi on ^/yiiXi: 

ee argmin En[wiidi - x[ef] + ^\\9\\i 

6 

Step 3 Run Post-Lasso-Logistic of yi on di and tlie covariates selected in Step 1 and 2: 
(d,/3)G argmin A(a,/3) : support(/3) C support(/3) U support(0) 

Define the confidence region with asymptotic coverage 1 — .^ as 

CTZd = {a G M : |a - a| ^ ?n^~^(l - C/2)/VH}. 

Table 2. The double selection algorithm has three steps: (1) select covariates based 
on the standard Lasso-Logistic regression, (2) select covariates based on the weighted 
treatment equation via Lasso, and (3) run a Logistic regression with the treatment and 
all selected covariates . We assume the normalization E„[a;f] = 1 and En[df] = 1, and 
Ai = M77^$-i(l - 0.01/{pV n}) and Aa = n^/a. 

stated in Tabled] uses zqj '■= Vi/y/wl. This requires in Step 2 a Lasso method that is applied in 
the weighted equation (j2.2p . Since the weights WiS can depend on the treatment di, the response 
variable in Step 2, it creates additional difficulties in the choice of penalty that are addressed 
in the appendix. Another valid choice of instrument is ZQi := {di — E[(ij | Xi\)/wi. In this case, 
assuming E[(ij | Xi] = x'i9d, we can estimate zqh by estimating 6d via standard Lasso of di on Xi, 
and estimating Wi using the estimates of the £i-logistic regression as in Step 1. Note that since 
no estimated weights are used in Lasso, more standard results on Lasso would be available to 
analyze the estimate of 9 a- We further discuss the choice of instruments in Section O 

Comment 2.2 (Alternative Implementations). As discussed before, the three step approach 
proposed here can be implemented with several different methods each with specific features. 
For instance, Dantzig selector, square-root Lasso or the associated post-model selection could 
be used instead of Lasso or Post-Lasso. Moreover, the instrumental logistic regression can 
be substituted by a 1-Step estimator from the ^i-penalized logistic estimator a of the form 
a = a + {IKniwidiZiiy^lKniiyi - G{dia + Xi/3)}zi]. 
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3. Main Result 

3.1. Primitive Assumptions. In this section we list and discuss primitive conditions that 
allow us to derive our results. (Sharper high-level conditions are stated in the Appendix.) We 
consider the following quantities associated with the covariates in the sample Xi = {di,x[y, 
i = 1, . . . ,n. We denote its largest value as Kn = maxj<g„ ||xj||oo and denote the minimum and 
maximum ?7i-sparse eigenvalues associated with the sample 

(Pmrnim) := mm n^,,^ and (pmaxi'm) := max n^n^ 

The following are sufficient primitive conditions. 

Condition L. (i) The data {(yi,di,Xi) : i = l,...,n} obey the model given by (j2.ip and 
(|2.2p . and are independent across i. (ii) The weights satisfy minj^„ Wi ^ c> 0, and the following 
moment conditions hold < c ^ E[vf \ Xj] ^ maXj^„{E[?;^ | Xj]}-^/^ V {E[df \ Xj]}^/^ ^ C. (iii) 
For some (5„ — )• we have c ^ (f'mmis/Sn) ^ </'max(s/<Jn) ^ C with probability 1 — A„. 

Condition L(i) assumes independence across i and the model described in Section [2j Condition 
L(ii) assumes the conditional variance is bounded away from zero and mild moment conditions. 
Condition L(iii) assumes that the sparse eigenvalues of size s/in are well behaved. This typi- 
cally is implied by mild conditions on the tails of the covariates and a growth condition if the 
underlying population design matrix has the corresponding sparse eigenvalues well behaved. 

3.2. Uniformly Valid Confidence Regions. In this section we state the main inferential 
result of the paper. We provide two uniformly valid confidence regions for the coefficient oq. In 
addition to the primitive conditions listed in the previous section we also require some growth 
conditions on how fast the sparsity parameter s and the total number of controls p can growth 
relative to the sample size n. 

Theorem 1 (Robust Inference, Optimal Instrument). Let {Pn} be a sequence of data- generating 
processes. Assume Condition L hold for P = P„ for each n. Then, provided that (K^ + K^s'^ + 
s^) log (p V n) ^ n6n, the estimator based on optimal instrument a obeys as n —t- oo 

a~^^/n{a — ao) -^ N{0, 1), 

where o"^ = E[f?]~^. Moreover, we have that 

nLn{ao) -w x^(l) 

and a^ = {E,n[widi'zi\}~^Kn[{yi — G{dia + x[f3)}'^^f]{En[widi'Zi]}~^ is a consistent estimate for 

„2 
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Theorem [1] establishes two asymptotic results that justify the confidence regions for ao pro- 
posed in Table [TJ The validity of CTZd is based on asymptotic normality of the estimator a. 
This prompts the construction of confidence intervals with correct asymptotic coverage. The 
confidence region CTZj is based on an inverse statistics exploiting the pivotality of nL„(ao). 
These results are achieved despite possible model selection mistakes. 

The following result derives similar guarantees for the double selection estimator. 

Theorem 2 (Robust Inference, Double Selection). Let {Pn} be a sequence of data- generating 
processes. Assume Condition L hold for P = P„ for each n. Then, provided that {K^ + K^s'^ + 
s^) log (p V n) ^ n6n, the double selection estimator a obeys as n —t- oo 

an^\^{a — ao) -^ N{0, 1), 

where a^ = E[vf]-'^ and a^ = {En[widiZi]}-'^En[{yi - G{dia + x'^$)Yzf]{¥.n[widiZi\]-^ is a 
consistent estimate for a'^ with Wi = G{dia + x[$){l — G{dia + x[$)}. 

The results achieved in Theorems [1] and [2] are valid uniformly over a large class of data- 
generating processes. The next corollary formally remarks that. Here Q„ denotes a collection 
of distributions for {iyi,di,x'^'}^^i and for Q„ G Q„ the notation Pq^ means that under Pq„, 
{{yi, di, x^'liLi is distributed according to Qn- (An analogous corollary also holds for the double 
selection method.) 

Corollary 1 (Uniformly Valid Confidence Regions). Consider the confidence regions based 
on the optimal instrument and let Qn be the collection of all distributions of {{yi,di,x'j^)'}^^i for 
which the conditions of TheoremU\ are satisfied for given n ^ 1. Then as n ^ oo, uniformly in 

Pq^iao e CUd) ^ I - ^ and Pq„(qo G C7^/) ^ 1 - e 

Corollary [1] conveys the robustness of the approach. In particular it allows for the data- 
generating process to change with n. This result is new even under the traditional case of fixed-p 
asymptotics. A characterization of the data- generating processes Q„. ^or which the uniformity 
result holds is given by Condition L and the side condition {K^ + K^s"^ + s^) log'^(p V n) ^ n5„. 

Comment 3.1 (Choice of Valid Instruments and Minimax Efficiency). In the appendix 
we establish a more general result for any valid instrument under high-level conditions. Given 
any valid instrument (-Zoj)F=i ^^ have 
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Therefore, the choice of instrument can be guided by efficiency considerations. Theorems [1] and 
Oestabhsh that the proposed estimators achieve the semiparametric (minimax) efficiency bound 
for the partiahy hnear logistic regression (see page 356 in [lO]). 

3.3. Weakening Requirements for Testing Hq : oq = 0. In many apphcations the main 
goal is on testing the null hypothesis Hq : ao = 0. Under Hq, the weights Wi no longer depend 
on the treatment, which in turn leads to weaker conditions for the asymptotic consistency and 
normality of a under the null. 

Specifically, when ao = we have 

El^/uTiVi I Xi] = ^/miE[vi I Xj] = 0, i = 1, . . . , n. 



In the Logistic model associated with (j2.ip we have Wi > which makes the condition above 
equivalent to E[vi | Xj] = 0, i = 1, . . . ,n. Therefore, one can estimate ^o in (|2.'2p without using 
weights, or assume Wi to be a function of Xi only, and then estimate accordingly. Either way, 
the standard orthogonality condition required for Lasso and Post-Lasso will hold under the null 
and therefore standard choice of penalty parameter can be used. 

To exploit that we need to modify the method proposed in Table [TJ In Step 1 we can set Wi = 
G{x[f3){l — G(x^/3)}, i = 1, . . . , n, or set iBj = 1, i = 1, . . . , n. In Step 2, we use penalty level and 
penalty loadings for heteroscedastic lasso as in [6J, for example A = 2cy^<l>~^(l — 0.1/(2plogn)) 
and penalty loadings F as in ()C.34p . In Step 3 we have two options to proceed. One possibility 
is to reject Hq if |a|/an is large, the second possibility is to reject Hq if nL„,(0) is large. The 
next result shows the validity of this proposal. 

Theorem 3. Let {?«} be a sequence of data-generating processes. Assume Condition L and the 
null hypothesis Hq : oq = hold for P = P„ for each n. Then, provided the growth requirement 
K^ \ogp + K'^s'^ log p ^ 5nn holds, we have 

a:^^^/^a -^ N{0, 1) and nL„(0) -^ x^(l)- 

Theorem [3] provides two ways for testing ^o : ao = that are valid under weaker sparsity 
requirements when compared with Theorem [TJ As in the case of Theorem [1] this result does 
not rely on perfect model selection. This is fundamental for the uniformity properties of this 
procedure and for robustness of the finite sample inference. 
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4. MoNTE Carlo 

In this section we provide a simulation study to illustrate the finite sample performance of 
the proposed estimators. We compare its performance with the standard post-model selection 
estimator. 

We consider the following regression model: 

E[y \ d,x] = G{dao + x{cyVy]), d = x{cdyd] + v, 

where the vectors Uy and i^d are set to 

i^yj = (1, 1/2, 1/3, 1/4, 1/5, 0, 0, 0, 0, 0, 1, 1/2, 1/3, 1/4, 1/5, 0, 0, ... , 0)', 
i^dj = (1, 1/2, 1/3, 1/4, 1/5, 1/6, 1/7, 1/8, 1/9, 1/10, 0, 0, ... , 0)', 

X = {l,z'y consists of an intercept and covariates z ~ N(0,Ti), and the error v is i.i.d. as 
A^(0, 1). The dimension p of the covariates x is 250, and the sample size n is 200. The regressors 
are correlated with Sjj = p'^~^' and p = 0.5. The coefficient Cd is used to control the R^ of the 
reduce form equation, Cy is set similarlyU In the simulations below we will use different values 
of ao, Cy and Cd- For each repetition we draw new vectors Xj's and errors Vj's. 

The design above with x'^CyUy) and x'{cdi^d) is a sparse model. However, as we vary the 
coefficients Cy and Cd the decay of the components rules out typical "separation from zero" 
assumptions of the coefficients of "important" covariates. Thus, we anticipate that inference 
procedures which rely on a consistent model selection will not be robust in our simulation study. 

Figure [U considers the design above with ao = 0.5 and R^ = 0.75 on each equation leading to 
Cy = 1.43 and C2 = 0.59 and it is based on 500 repetitions. The figure displays the studentized 
distribution of the estimator based on standard post-model selection logistic regression where 
the model selection is based on £i-penalized logistic regression, and the studentized distribution 
of the estimator defined in Algorithm 1. The performance of the proposed estimator confirms 
the asymptotic normality result established in Theorem [TJ In contrast the distribution of the 
standard post-model selection estimator seems to be a bi-modal non-standard distribution. 

Figures [2]|3] display the rejection probabilities of the standard post-model selection logistic 
regression and the of the confidence regions associated with CTZd , CTZj and the double selection 
CTZd- For the latter we have two asymptotically valid ways of estimating its standard deviation. 
We compute both estimates and take the maximum. The figures display the results over 400 



For simplicity, the "R^" associated with Cy is computed as the R^ in the equation y — dao + x' {cyVy} + e 
where e ~ A'^(0, 1) and y is a real valued auxiliary variable. 
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Logistic Regression: post-single-selectiou estimator (studentized) Logistic Regression: proposed estimator (studentized, robust) 




7-6-5-4-3-2-1 1 2 3 4 5 6 7 




Figure 1 . The figures display the distribution of the standard post-model selection estimator 
(left panel) and the proposed estimator based on Algorithm 1. The rejection probabilities at 5% 
level of the standard single model selection procedure was 14.6% and the rejection probabilities 
at 5% level of the proposed estimator was 4.6%. The results are based on 500 replications. 

different designs where oq G {0,0.1,0.25,0.5} and tlie values of Cy and q are set to achieve 
i?2 = {0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9} for each equation. There were 1000 rephcations 
for each of the 400 designs. 

The figures shows the uniform properties of these estimators. The performance of the standard 
method coincides with the negative theoretical results of the literature. In particular it seems 
very sensitive to the particular design exhibiting a substantially larger rejection probability 
than the 5% level. The other three confidence regions seem substantially more robust. Such 
performance is predicted by the theoretical results in Section [3l As the impact of the treatment 
increases the impact of using estimated weights for Step 2 seems to increase. This seems to 
be more problematic to the confidence region CTZo based on optimal instrumental. However 
iterated versions of the method seems to achieve a very robust performance as indicated by the 
confidence region based on double selection. 



5. Discussion 

5.1. Connection between Double Selection and Optimal Instrument. In this section 
we provide a more formal connection between the two proposed methods. It turns out that the 
construction of the double selection estimator implicitly approximates the optimal instrument 
zoi = Vi/^yToi. This occurs because the model selection procedure in Step 2 associated with (|2.2|) 
allows the estimator to achieve uniformity properties. To see that let T* denote the variables 
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Standard Selection rp(0.05) Oq - 



C7^D rp(0.05) 




R'^ond ■ R2^^y 

CTZi rp{0.05) 




R"^ on d 




R^ on d 



2 
° ' R^ony 



Double Selection/Iterated CTZd 
with Robust s.e. rp(0.05) 



R on J/ 




R^ Olid 



i? on J/ 



Figure 2. The figures display the rp(0.05) of the standard post-model selection estimator 
and the proposed confidence regions based on optimal instrument (CTZd and CTZi) and double 
selection. There are a total of 100 different designs with ao — 0. The results are based on 1000 
replications for each design. 

selected in Step 1 and 2, T* = support(/3) U support(0). By the first order conditions that 



Kn[{yi-G{dia + x'i$)}{di, x' 



iT* 







creating an orthogonal relation to any linear combination of (dj, x'-^)'. In particular, by taking 
the linear combination (dj, x'- )(1, —9')' = di — x[9 = 2j, we have 

Therefore the double selection estimator a minimizes 

I .. ^ \\&n[{y^-G{dia + x'ip)}%]\\'' 

where % is the instrument created by Step 2. Thus the double selection estimator can be 
seen as an iterated version of the method based on instruments where the Step 1 estimate /3 
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Standard Selection rp(0.05) 



QO =0.1 




i?2 on d "■' ^: 



CTZ[ rp(0.05) 



on y 




R' on d "■' g. 



C-Rd rp(0.05) 




i?2 on d 



° °-%2„ny 



Double Selection/Iterated CTZd 
with Robust s.e. rp{0.05) 



on y 




R' ond 



R"^ on 1/ 



Figure 3. The figures display the rp(0.05) of the standard post-model selection estimator 
and the proposed confidence regions based on optimal instrument (CTZd and CTZi) and double 
selection. There are a total of 100 different designs with qo = 0.1. The results are based on 
1000 replications for each design. 

is updated with (3. Although their first order asymptotics do not change, in finite sample the 
double selection approach allows for the model selection in one step potentially mitigate model 
selection mistakes of the other step. A potential drawback of the double selection is to work 
with slightly bigger models (and thus potentially slightly more stringent design conditions). 



5.2. Connection to Neymanization. Here we discuss connections between the proposed ap- 
proach and Neyman's C{a) test ( [15\ I16j). For the sake of exposition we focus on the case 
that {{vi, dj, Xj)}"^^ are independent and identically distributed and assume the instruments are 
known. We consider the estimating equation for ag: 



E[{yi - G{diao + x%)}zoi] = 0. 
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ao = 0.25 



Standard Selection rp(0.05) 




_R2 on d ■ j^2 



CTli rp(0.05) 



on y 




R^ond ■ ^2onj; 



CTZd rp(0.05) 




R^ ond 



R on y 



Double Selection/Iterated CTZd 
with Robust s.e. rp(0.05) 




R' ond 



R on y 



Figure 4. The figures display the rp(0.05) of the standard post-model selection estimator 
and the proposed confidence regions based on optimal instrument (CTZd and CTZi) and double 
selection. There are a total of 100 different designs with ao — 0.25. The results are based on 
1000 replications for each design. 



Our problem is to find an useful instrument zoi such that 

d 

— E[{yj - G{diao + x-/3)}zoj]|/3=/3o = 0. 

Under this property, the estimator of ao will be "immunized" against "crude" or nonregular 
estimation of /3o, for example, via a post-selection procedure or a regularization procedure. 
Such immunization ideas are in fact behind Neyman's classical construction of his C{a) test, 
so we shall use the term "Neymanization" to describe such procedure. Although there will be 
many instruments zoi that can achieve the property stated above, the one proposed in Section 
[2] is optimal. 



(5.3) 



Instruments can be constructed by generalizing the weighted equation (j2.2 

fidi = fimo{xi) + Vi, E[fiVi\xi] = 0, 
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Standard Selection rp(0.05) 



an =0.5 



CTId rp(0.05) 




R^ond ■ R^o^y 

CTZi rp(0.05) 




i?2 on d 




R^ on d 



° ' R^ony 



Double Selection/Iterated CTZd 
with Robust s.e. rp{0.05) 




R on y 



R^ond ■ R2ony 



Figure 5. The figures display the rp(0.05) of the standard post-model selection estimator 
and the proposed confidence regions based on optimal instrument (CTZd and CTZi) and double 
selection. There are a total of 100 different designs with qq = 0.5. The results are based on 
1000 replications for each design. 

where fi = f(di,Zi) is a nonnegative weight, and setting the instrument as z^i := fiVi/wi. For 
example fi = 1 as proposed in Comment 12.11 or fi = ^/wi as in (j2.2p . By construction, the 
function mQ{xi) solves the weighted least squares problem 



minE[/fH-Mx,)}2], 



(5.4) 



where T-L denotes measurable functions h{xi) such that 'E[ffh'^{xi)] < oo. Our assumption is that 
the mQ{xi) can be written as a sparse combination of Xi, namely mo(xi) = x'-9q with ||^o||o ^ s, 
so that 

fidi = fix'iOo + Vi, E[fiVi\xi] = 0. (5.5) 

In finite samples, the sparsity assumption allows to employ Post-Lasso or Lasso to solve the least 
squares problem above approximately, and estimate zoi as proposed in this work. Of course, the 
use of other structured assumptions may motivate the use of other regularization methods. 
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The proof of Lemma [1] shows that, for -^/n(a — oq) = 0(1), 

y/n{En[{yi - G{dia + x'j3)}zQi] - E„[{yi - G{dia + x-/3o)}zoi]} = op(l), 

for /3 based on a sparse estimation procedure, despite the fact that /3 converges to /3o at a slower 
rate than 1/ ^/n. That is, the empirical estimating equations behave as if Pq is known. Hence 
for estimation we can use a as a minimizer of the statistic 

£„(a) = ||^/^E„,[{yi - G{dia - x'$)}zoi]f I ^AiVi - G{dia - x'3)Yzl]. 

Since £„(ao) -^ X^(l)i we can also use the statistic directly for testing hypotheses and for 
construction of confidence regions. 

This is in fact a version of Neyman's C{a) statistic. To see that note that by (|5.5|) we have 
^0 = E[/fa;jj;^]~E[/?djx'J, where A~ denotes a generalized inverse of A, and define 

ZQi = fiVi/wi = {fi/wi)di - {ff /wi)x'iE[ffxix'i]~E[fidiXi\ and £i{a) = yi - G{dia - x'j3) 

Thus we obtain a familiar form of the C{q) statistic 

The estimator a that minimizes £„ up to op(l) so that, under suitable regularity conditions, 
'5"n^\/"(« - «o) -^ ^^(0, 1), cri = E[widiZQi]~'^E[wizli]. 
It is easy to show that the smallest value of (T^ is achieved by setting /j = ^/wl so that zqi = 

a2=E[uf]-^ (5.6) 

Thus, setting /j = ^Jwl gives an optimal instrument amongst all "immunizing" instruments 
generated by the process described above. Obviously, this improvement translates into shorter 
confidence intervals and better testing based on either a ox Ln- While /j = ^Jwl is optimal, 
Wi will have to be estimated in practice, resulting actually in more stringent condition than 
when using known weights. This motivates the search for different weights that might have less 
stringent requirements. For example, under the limited heteroscedasticity of the model when 
-ffo : ao = holds. Theorem [3] reduces the requirements. 
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5.3. Minimax Efficiency. There is also a connection to the (local) mininiax efficiency anal- 
ysis from the semiparametric efficiency analysis. [10] derives an efficient score function for the 
partially linear logistic regression model: 

Si = {Vi - G{diao + x'if3o)}{di - mo(xj)}, 

where rriQ^Xi) is mo{xi) in (|5.3p induced by the weight fi = ^Jwl: 

^[widi\xi] 
moixi) = --^ — -—. 
E[wi\xi\ 

Using the assumption TiiQ^Xi) = x[9q , where ||&o||o < s ^ n is sparse, we have that 

Si = {Vi - G{diao + x'i|3o)}vi/^/w'i, 

which is the score that was constructed using Neymanization. It follows that the estimator based 
on the instrument zqi = Vi/^/wi is actually efficient in the minimax sense (see Theorem 18.4 in 
|10j). and inference about ao based on this estimator provides best minimax power against local 
alternatives (see Theorem 18.12 in |10)). 



The claim above is formal as long as, given a law Qn, least favorable submodels are permitted 
as deviations that lie within the overall model. Specifically, given a law Qn, we shall need to 
allow for a certain neighborhood Q„ of Qn such that Qn C Qn C Qn, where the overall model 
Qn is defined similarly as before. To allow for this we consider a collection of models indexed 
by a parameter t = {ti,t2)'- 

E[yi\di,Xi\ = G{di{ao + h} + x'i{Po + t20o}), \\t\\ ^ d, (5.7) 

^/w'idi = ^/vTiXiOo + Vi, Ely/wivilxi] = 0, (5.8) 

where ||/3o||o V ||0o||o ^ s/2 and conditions as in Section [3] hold. The case with t = generates the 
model Qn', by varying t within a (5-ball, we generate models Qn, containing the least favorable 
deviations. By [10], the efficient score for the model given above is Si, so we cannot have a 
better regular estimator than the estimator whose influence function is Si. Since our model 
Qn contains Q„, all the formal conclusions about (local minimax) optimality of our estimators 
hold from theorems cited above (using subsequence arguments to handle models changing with 
n). Our estimators are regular, since under Q^ with t = (0(l/-^/n), 0(1)), their first order 
asymptotics do not change, as a consequence of Theorems in Section [3l 
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Appendix A. Generic Instrumental Logistic Regression with Estimated Data 

Let (d, x) £ V X X. In this section for h = {13, z), where z is a function on (d, x) i— ?■ z{d, x) we 
write 

'^a,hiyi^ di' ^i) = ^a3,z^yi^ di,Xi) = {vi - G(x-/3 + dia)}z{di,Xi) = {vi - G{x'if3 + dia)}zi 

For a fixed a G E, /3 G RP, and z : P x Af ^ E we define 

r(Q, h) := E[ip--^{yi, di,Xi)] 

For notational convenience we let Zi = z{di,Xi), ho = {I3q,zq) and h = {f3,z). The partial 
derivative of F with respect to a at (a, h) is denoted by ri(a, h) and the directional derivative 
with respect to [h — ho] at (a, h) is denote as 

We assume that the estimated vector (3 and the estimated function z satisfy the following 
condition. 

Condition ILOG. We assume that for some sequences (5„ — )■ and A„ — )■ with probability 
at least 1 — A„: 

(i) {a : \a — ao| ^ n^^''^/Sn} C A, where ^ is a (possibly random) compact interval; 
(ii) E[wiZoi \xi] = 0,0<c^ \E[widiZoi]\, and maxi^„{E[(zoi/tfi)^]}i/2 ^ {E[df]}V2 ^ c. 
(iii) the estimated quantities h = {f3,z) 

max{l + E[\zi - zo^\]y/^\\x',0 - Mh,n ^ 5nn-l/^ 

*<" _ (A.9) 

{E{{% - zoif]V/^ ^ 6n, K0-I3o)h,n • {E[{% - zoif]V/^ ^ 5„n-V2, 



sup 



(E„-E) ip^-f^{yi,di,Xi) -ipa,hoiyi^di,Xi) ^ (5„ n ^/^ (A. 10) 

\a-ao\i^Sn and E„[V'^j^(yj, (ij,a;j)]| ^ (5„ n~^/^. (A. 11) 

(iv) 11(1 V \di\){zi - zoi)\\2,n < K and ||{a:'.(/3 - /3o)}2||2,„ ^ 5,,. 

Lemma 1. Under Condition ILOG(i,ii,iii) we have 

{^WidiZo,t^E[wizli]E[widiZo.i\-'^}-^'^^/^{a - oq) ^ iV(0, 1). 
Moreover, if additionally ILOG(iv) holds we have 

nL„(ao) -> x^(l) 
and the variance estimator is consistent, namely 

¥.n[widiZi]^^En[{yi - G{xif3 + dia)}'^zf]En[widiZi]^^ ->p E[widiZoi]~^E[wiZQi]E[widiZoi]~^. 
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Proof of LemmaUl Steps 1-4 we use ILOG(i-iii). In Steps 5 and 6 we will also use ILOG(iv). 
Step 1. (Main Step for Normality) We have 

(0) 



^n[tp^J^{yi,di,Xi)] =En[iJao,hoiyi^di,Xi)] + E„[^^^^(yi, dj, Xj) -lpao,ho(yi^di,Xi)] 

= ^nbPaoMiVi^ di,Xi)] + r(a, h) + n"^/^G„(V'^ ^ - Va.fto) + n~'^^'^Gn{'>pa,ho - '^com) 

'- V ' ^ V ' s J. / ^ V ' 

{I) {II) (III) (IV) 

By Condition ILOG(iii), (jA.lip . with probability at least 1 - A„ we have |(0)| < (5„n"^/2. 
By Step 2 below we have |(//) + E[wj(ijZoj](<i — ao)\ <p (5„n~^/^ + 5n\a — ao\. 
By Condition ILOG(iii), (jA.lOp . with probability at least 1 - A„ we have |(///)| < (5„n"^/2. 

To control {IV) note that 
\'^a,hoiyi^di,Xi) -ipao,hoiyi^di,Xi)\ ^ |G(dja + x-/3o) - Gi{diao + x'il3o)\ ■ \zoi\ ^ \a - ao\ ■ \diZoi\. 

By Condition ILOG(iii), (jAJJJ , we have |d-ao| < KsothatE[{ipa,hoiyi,di,Xi)-'ipao,hoiyi^di,Xi)}'^] ^ 
\a — ao\'^E[df Zq^] and using a version of Theorem 2.14.1 in |2D] we have 

sup E„[{V'a,/io(yi. di,Xi) - tpaoMiVi^ di,Xi)}^] ^ dlEnidjz^i] < dlE[djzQi] 

with probability 1 — A„ from concentration of measure and Condition ILOG(ii). These relations 
and the maximal inequality in Lemma [U we have 



{I'^) ^P sup n ^^"^GnilpaM - i'aoM) 

\a-ao\!^5n 

<P n"^/^ sup \a - ao\E[dfzQi\ < dnU'^^'^ 

\a-ao\^Sn 

Combining the bounds for (0), (II)- (IV) above we have 

E[widiZoi]{a - ao) =En[ipao,ho(.yi^di,Xi)] + Op{Snn'''^/'^) + Op{Sn)\a - ao\. 
Since E[ipao,hoiyij di, Xi)] = and E[i(;jZqJ ^ C, by the Lyapunov CLT we have 

(/) = En[i:ao,ho(yi^di,Xi)] -> N{0,E[wiZQi]) 

and the first assertion follows by noting that E[widiZQi] ^ c > 0. 

Step 2. (Bounding T{a,h) for \a — ao\ ^ 5n which covers (//)) We have 

T{a,h) = T{a,ho) + T{a,h) — T{a,ho) 

= T{a, ho) + {T{a, h) - r(a, /iq) - T2{a, ho)[h - ho]} + T2{a, ho)[h - ho] 



(A.12) 



(A.13) 
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Because r(ao, ho) = 0, by Taylor expansion there is some a £ [ao, a] such that 

r(a, ho) = Ti{a, ho){a - ao) = {ri(ao, ho) + r]n} (a - ao) 
where |r/„| ^ 5„E[|(i?zoj|] by relation (|A.20p in Step 4. 



Combining the argument above with relations (jA.lSp . (|A.16p and (jA.lSp in Step 3 below we 
have 

Tia,h) = T2{ao, ho)[h - ho] + r(ao, ho) + {ri(ao, ho) + 0{6nE[\djzoi\])}{a - ao) + 0(5„n-i/2 
= ri(ao, ho){a - ao) + 0{5n\a - ao\E[\dj zoi\] + (5„n~i/2) 

(A.14) 

Step 3. (Relations for r2) The directional derivative r2 with respect the direction h — ho at 
a point h = {13, z) is given by 

T2{a, h)[h - ho] = -E[G'{d,a + x',P)z,x',{/3 - /3o}] + 

+E[{G{diao + x%) - G{dia + x[^)}{zi - Zi}] 

Note that when r2 is evaluated at (oq, ho) we have 



r2(ao, ho)[h - ho] = -E[wiZoix[{l3o - Po)] = 



(A.15) 



because of the orthogonality condition E[wiZoi | Xj] = in Condition ILOG(ii), and by definition 
Wi = G{x'^l3o+Ciodi){l — G{x'^j3o + aod,i)}. In addition, the expression for r2 leads to the following 
bound 



T2{a,ho)[h-ho] -T2{ao,ho)[h-ho] ^ 
^ E[|a - ao\ \diZoi\ \x'.i{Po - Pq]\] + E[|(a - ao)di\ \zi - zoi\] 
^ |a - ao\ ■ \\x',{^ - /3o}||2,n{E[4df]}V2 + ]« _ ao\ ■ m% - zo^)^]Yl^{E[d'i]Y'^ 
< |q - ao\6n 



(A.16) 



To bound the second derivative, recall that for G{t) = exp(t)/{l + exp(i)}, we have G'{t) = 
G{t)[l-G{t)], G"{t) = G{t)[l-G{t)][l-2G{t)], are all less than 1 in absolute value. The second 
directional derivative r22 at h = (/3, z) with respect to the direction h — ho can be bounded by 



T22{a,h)[h-ho,h-ho] = -E[G"{x',p + ad,)z,{x',{f3o-f3o)V] 
-2E[G'(x',/3 + dia){x[{d - mi^i " ^0] 



(A.17) 



^ max,^„E[|z,|]||x',(/3 - /3o)||L + 2||x',(/3 - /3o)||2,n{Ep, - zo^)^]} 



2111/2 
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T{a,h) — T{a,ho) —T2{a,h 



0) 



h — h, 



r2,2(a,/i) 



^ (maxE[\zoi\] + E[\zi-zoi\]] \\x'i{f3 - ^g ^+ 

+ lk-{^-/3o}||2,n{EP.-Zo.)']}^/' 



h — ho, h — ho 



(A.18) 



where maxj^„E[|zoj|] is uniformly bounded by ILOG(ii) and the last relation is assumed in 
Condition ILOG(iii). 

Step 4. (Relations for Fi) By definition of F, its derivative with respect to a at (a, h) is 

Ti{a, h) = —E[G'{x[f3 + adi)zidi]. 



Therefore, when the function above is evaluated at a = qq and h 
G'{x',-I3q + aodi) = Wi we have 



Ti{ao,ho) = -E[widiZoi]. 



ho = (/3o;-2o)) since for 



(A.19) 



Moreover, Fi also satisfies 



\Ti{a,ho) -Ti{ao,ho)\ = |E[G'(x-/3o + adi)zoidi\ -E[G'(x-/3o + aodi)zoidi\\ 

^ |a-ao|E[|dfzoj|] 



(A.20) 



(A.21) 



Step 5. (Estimation of Variance) First note that 

\E.n[widi2i\ - E[widiZo,i\\ 

= \En[widiZi] - Ka[widiZoi\\ + |E„[-u;j(ijZoi] - E[widiZoi]\ 

^ |E„[(-u;j - Wi)diZi]\ + \E.n[widi{zi - zoi)]\ + \Kn[widiZoi\ - E[widiZoi 

^ |E„[(-u;j - Wi)di{zi - zoi)]\ + |E„[(-u}j - Wi)diZoi]\ 

+ \\Widi\\2,n\\% - Zoi\\2,n + | E„[wj(ijZoj] " E[tt;j(ijZoj] I 

<P \\{Wi - Wi)di\\2,n\\Zi - Zoi\\2,n + \\Wi - 'Wj||2,n||di^0i ||2,n 

+ ||w'idi||2,n||2i - Zoi\\2,n + \^n[WidiZo,i\ - E[widiZo,i\\ 



because ^ Wi,Wi ^ 1, E[df] ^ C, E[z^.] ^ C by Condition ILOG(ii) and Conditions ILOG(iii) 
and (iv). 
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Next we proceed to control the other term of the variance. Smce \ip^ hiVi^ di,Xi)—tp^ ^(yj, dj, Xj)| ^ 
\di{a - ao)zi\ and \^p^^^{yi,di,Xi) - ipao,hoiyi'di,Xi)\ ^ {% - zoi\ + {x'^iP - f3o}zoi\ we have 

I Il^ci,fc(2/«''^«'^«)ll2," - HaoMiyi^di,Xi)\\2,n\ 

^ \\di{a - ao)zi\\2,n + \\zi - zoj||2,n + ||a:'j{/3-/3o}^;oi||2,n 

^ |d - aollldjZoilb.n + |a - ao\\\di(zi - zoi)\\2,n (A. 22) 

by ILOG(ii) and ILOG(iv). Also, |E„[V'^^^^^(yi, dj,Xi)] -'E[\l)l^^^^^{yi,di,Xi)]\ <p 6n by mdepen- 
dence and bounded moment E[i(;jZqJ ^ E[zqJ by Condition ILOG(ii). 

Step 6. (Main Step for x^) Note that the denominator of L„(ao) was analyzed in relation 
(IA.22P of Step 5. Next consider the numerator of L„(qo). Since r(ao, /io) = ^bPao,ho{yi,di,Xi)] = 
we have 

^n[^a^J,iyi,di,Xi)] = {Kn-E)[^^^j^{yi,di,Xi)-lpao,hoiy-hdi,Xi)]+T{ao,h)+En[i)ao,hoiyi^di,Xi)]. 

By Condition ILOG(iii) and ()A.14p with a = oq, it follows that 

\{En-E)[il)^^-f^(yi,di,Xi) -ipao,ho(.yi^di,x.i)]\ ^ (5„n"^/^ and \r{ao,h)\ <p (5„n"^/^. 

Therefore, using that nA^ = nB^ + n{An - Bnf + 2n5„(A„ - S„), for An = E„[^^^ j^(yi, dj, Xj)] 
and En = En[il)aoMiyi^di,Xi)] <p {E[t(;jZ^.]}i/2n"^/^ we have 

'>^\^n[tp^^^l{yi,di,Xi)]\'^ 



nLn{ao) 



^nii^l ■j^{yi,di,Xi)] 

n\En['>pao,hoiyi'di,Xi)]\'^ + Op{dn) n|E„[V'ao,/io(2/i' ^i' ^*)]P , r^ rx \ 

~ 'rUp(dn) 



E[Wizl.]+Op{5n) HwiZ^il 

since E[u;jZqJ is bounded away from zero because c ^ |E[?i;j(ijZoi]l ^ {^['"^j^?]E[u'iZQj}^'^ and 
^[wjZqJ is bounded above uniformly. The result then follows since ^/nKnlipaoAoiyi^ di, xi)] — )-p 
N{0,E[wiZ^i]) and E['il^l^^f^^^{yi,di,Xi) \ Xi,di] = Wizl-. 

U 

Appendix B. Proofs of Theorems 

Proof of Theorem [IJ We will verify Condition ILOG and the result follows by Lemma [TJ The 
assumptions on the conditional variance Wi and the moment conditions on di and Vi in Condition 
L imply Condition ILOG(i). 
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We let Kx = maxj^„ ||xi||oo and K^ = maxj<g„|dj| so that K^ V K^ ^ Kn. Under the 
assumption on the weights stated in Condition L(ii) and the sparse eigenvalue bounds stated in 
Condition L(iii) we have that hCq is bounded away from zero for n sufficiently large, see [8]. 

Step 1 relies on Post-Lasso-Logistic. To apply Lemma [5] to obtain rates and sparsity bounds 
we first verify the side condition q^^ > 3(1 H — )X^/s/{nKc). Without loss of generality assume 
that T contains the treatment d in its support. Thus for Xi = {di,x[)', 5 = ((5(^,5^)' we have 

™^^eAc E„[|x^5p] ^ ini^eAc 4E„[|x^<5,p]+4E„[|d,5dP] ^ I'^^-'eAc 4K.||5,||iE„[|xJ5,p]+4|5d|^E„[|d,P] 

^. r ll^^^lli.nPT||«c 

^ ""-^e^c 4i4',p,||i{p'^5||2,„ + ||<5drfd|2,nF+4|<5d|^E„[|d,P]||5T||l 

^ ■ r ll^i'^lli.nPTljlKc/v^ 

^ ^-^eAc 8K41+c)||5T||ip;5||^„+8K4l+c)PTl|i|'5dP{|Mdll„+]En[M»P]} 
^ 8X4l+c){l+p,||2„/Kg+E„[|d,|3]/K2} ~-P ^sK^ 

by E[(i^] ^ C and k^ bounded away from zero by Condition L. Also by Condition L we have 
minj^„ ujj > c > 0. Therefore, since A < ^Jn\og{pyri) we have 

nKc . , Wy/wlx'^SWl 



n 



r II V ^ i II z,Ai ^^ V 

inf ^;;-7 r-^TTTTT ^P T7 ^ . ,, n ->P OO 



A^+ v^snlog(pVn) -JeAc E„[i(;j|x^5|3] ^ K^slogipVn) 
under K^s'^ log {pV n) ^ (5„n. 

To apply Lemma [3] we need to verify the side condition 



QA^+s/^ > Vs -h s||VA(r7o)||oo/\/(/'mm(S'+ s), 

where s <p s by LemmaO Similarly to the previous argument, for Xi = {di,x[y and 6 = {6d, S'^)' , 

ll^ll^l'J+Cs^^^PI^ ^ P||o"+C.4Enll<<5.PJ+4|5.PE4|rf.PI 

> :„f {0min(^+C«)}^/"l|g|P 

^ ||5||o"s+Cs4X.||5.||i</,^ax(s+Cs)||5..|P+4||5||3E„l|d,p| 

> {</-min(^ + C«)}"/^ > 1 
^ 4K^75+Ci?imax(s + Cs)+4E„[|di|3] ~^ A-^v^ 

by Condition L(iii) and E„[|(ij|^] < E[|(ij|^] ^ C with probability 1 — o(l). Therefore, by 
K^s^ log^(p V n) ^ (5„n and A < Y^nlog(p V n) we have 






Therefore |5 — ao| ^P Y^slog(p V n)/n so that ^ = {a : |a — a| ^ Clog^""^ n} ^ {a : |a — ao| ^ 
n~^''^/6n} under slog{p V n) log n ^ (5.„n which is required in ILOG(i). This also ensures the 
initial rate required for a in ILOG(iii) since a £ A. 

Step 2 relies on Post-Lasso with estimated weights. Condition WL(i) and (ii) is assumed by 
Conditions L. Condition WL(iii) follows from Lemma [J] applied twice with Q = vt and Qi = di 
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under the condition that K^logp ^ 5„n. We will set A = n^''^. The first part of Condition 
WL(iv), since G is Lipschitz, follows from Wwi — Wi||2,n ^ W^iW ~ M + di{ao — ao)||2,n ^p 
yslogpjn < {6n/Kx){K^^ A \/^/sn}. The second part of Condition WL(ii), note that 

'Wi -Wi 



max 



E. 



XijVi 



< 



— II O II / ^11 < /Slogp _i/3 

llJj 2,n2max XijVi/i/u^ 2,n ^P \ ^P On?^ ' 

js;p V n 



so that pn = n ^'^ by Lemma[3]with (^j = vij ^Jwl under K'^ logp ^ 5„n and using s^ log (jNn) ^ 
(5„n. Therefore, by LemmalHwe have ||a;'j(0— 0o)||2,ra ^p n~^/'^y^and ||0||o ^ Cs with probability 
1-A„. 

The choice of instrument is zoi = '^i/ ^/wl = di — x[9q and % = di — x[9 so that 

2i - ^0* = xH^o - ^} (B.23) 

The rates established above for (a, (3,9) imply ()A.9P in ILOG(ii) since by Condition L(ii) 



{1 + maxi^„ \x',i9 - 9o)\^/^}\\x',i(3 - (3o)h,n <p 1 + VK^sw^^ 



slogp 



< 



n 



-1/4 






|5-ao| ||x-(6i-6'o)||2,n ^^nn'^l'^ 
Moreover, Condition ILOG(iv) holds since 



illillx^^ - 0o)||2,n <p KxSn-i/3^n-i/3 
and 11(1 V \d,\){% - zo.)l|2,n ^ (1 + \\d\tll)\\{x\(9 - 9^)YtIl 



{x\(9 - 9^)f\\2,n ^K^\\9 

K(^-/3o)}2||2,n =0P(1) 



X2s2 



n 7ii/3 



Op(l) 



o(l). 



Next we verify Condition ILOG(iii). Let (pi^a) = yi — G{x'-(3+dia), (pi{a) = yi — G{x'^/3o+dia). 
Note that 

sup |(E„ - E) [ipi{a)zi - (pi{a)zoi]\ < sup |(E„ - E) [{^i{a) - (pi{a)}{zi - zo.,)]| + (B.24) 

aeA aeA 



+ sup |(E„ - E) [ipi{a){zi - ZQi)]\ + 

aeA 

+ sup |(E„ - E) [{^i{a) - lp^{a)}zoi 

aeA 



(B.25) 
(B.26) 



To bound ()B.24p . since \^i{a) — (pi{a)\ ^ \x[{/3 — /3o)|, {wil ^ 1, we use Cauchy-Schwartz to 
obtain 

(1^:241) s; \\x'^0 - Po)\\2,n{2\Kie - 9o)\\2,n} <p 6„n-'/\ 
To bound ()B.25p we consider 



^EM ^snp^eA 



(E„ - E) 



{ip,{a) - ip,{ao)}x[{9o - e) + (E„ - E) ip,{ao)x'^{9o - 9) 



<, 



sup 



aeA,\\S\\i^Csn~^/^ 



\{En-E)[{ip,{a)-ip,{ao)}x'M 



sup 

||(5||i^Csn-i/3 



|(E„ -E) [ip^{aQ)x[S] 
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Using Lemma [5] twice (one application of the lemma has Wij = diXij, T = {{a — ao)6 £ W : 
a £ A,\\6\\i ^ Csn~^'^} and ^j = 1, the other application is standard) we have 



since by LemmalU maxjsg.„E[(i^ | Xj] ^ C and maxjs;pE„[x?] ^ 1 we have 



maxE^fdfx?.] ^ max(E„ - E)[d^xl] + CmaxE„[x?.] < .n2^K^JEn[dj] + C <p C 



under K^logp ^ 5„n and E[(i|] ^ C. 

Next we proceed to bound ()B.26p . We will consider the class of functions J-c,r which pertains 
to {(pi[a) — (/9j(a)}zoj> namely for some C suitably large 

F = {G{x'i^ + dia)zoi - G{x% + dia)zoi : ||/3||o ^ Cs, ||xH/3 - /3o}||2,n ^ C^/slogp/n} 

which is the union of (^g) classes of functions each with VC index bounded by Cs. Thus, 
logiV(e||F||2,p„, J^, Pn) < slogp + slog(l/e). By Lemma [6] we have 

1/2 



slog(pVn) / - 21 , /slog(nVp) 



sup |E„[/]| <p V^ '- supE[/^] + W^ ^ /supE„[/4] vE[/4] 

Note that snpf^^En[f] V E[f^] ^ E„[4] V E[z^^] <p E[vf/wf] = 0(1), and 
sup^,^E[/2] ^sup E[(x^<5)2z2.] 

^ s^P|i /All < f^T^^ri[{Xi6)^]maxi^nE[zl \ Xi] 
<^max,^„E[t;2M|x.]<^. 

Combining these relations we have 

1/2 



(1^:26]) <p . /iM^M 'l}^i(PXA + /^log(pVn) , ^ ^^^^^_,/, 

provided s^ log {pV n) ^ 5„n. 

The last condition to be verified is the second condition in ILOG(iii). We will show that 
E„[^j(q)zj] changes sign over a £ A with high probability which by continuity of (^i(-) implies 
that E„[(^j(d)zj] =p 0. Note that for any a £ A 

(1) (2) 
' " ^ '^ ""^ ^ 

En[ipi{a)zi] = (E„ - E)[ipi{a)zi - ipi{a)zoi]+E[lpi{a)zi] - E[ipi{a)zoi] + 
+ (E„ - E)[^pi{a)zoi] +E[ipi{a)zoi]. 

^ V ' 

(3) 
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Note that by the first part of ILOG(iii) established before we have (1) <p dn n~^". By the 
expansion (|A.14p we have (2) ^ 5„n~^/2 + (5„|q- ao| from (jA.lSp . ()A.16P and ()A.17p . Moreover, 
we have by Lemma [5] and E[d?2;oi] = 0{1) 

(3) ^ sup |(E„ - E)[{^,{a) - ipi{ao)}zoi]\ + |(E„ - E)[^i{ao)zoi]\ <p bnrT^I'' ^ tT^I'' . 

Therefore, since E[c^j(a)zoi] = (a — ao)E[t)?] + 0{\a — aop) we have 
E„[(/?i(Q)zoj] = Opirr^l'^ + 5„|a - qqI) + E[93i(a)zoi] 



Op(n-i/2) + (a _ ao){E[t;2] + Op(5„)} + 0(|a - ao| 



2^ 



(B.27) 



Since E[u?] ^ c and (5„ — >• 0, when we evaluate ()B.27p on the extreme points a'^, /c = 1, 2, of ^ 
we obtain a positive value for one extreme and a negative value for the other extreme for n large 
enough since ja*"' — ao| = Clog" n. D 

Proof of Theorem\^ Let T* = support(0) U support(/3). By the first order condition we have 

E„[{yi - G{adi + x'M{di, x'.^J] = 0. (B.28) 

Next we will construct a suitable instrument to apply Lemma [TJ Define 

e* £ argmin \\x'i{e - 6'o)||2,n : support(6') C f*. 



We use the optimal instrument zq? = Vi/ y/wl = di — x'-9q and the estimated instrument % = 
di — x[9*. Note that by ()B.28[) . taking the linear combination (1; —9*) of the optimality condition 
we have 

En[{yi-Giadi + x'i$)}%]=0. 

Therefore a minimizes the criterion 

^ .. ^ \En[{y^ - Gjdia + x'M%]\^ 
"^"^ En[{y,-Gid,a + x'M^zf]' 

induced by {{x'-$, %) : i = 1, . . . , n}, over a G M. 

Regarding Steps 1 and 2, rates of convergence for Lasso-Logistics, Post-Lasso-Logistics, Lasso 
with estimated weights and the associated sparsity bounds are established as in the proof of 
Theorem [TJ Thus we have \\9\\o <p s, ||/3||o ^p s, A(S, /?) — A(qo,/3o) ^p slogp/n and \\x[{9 — 
Oo)\\2,n <p V^n-^/\ 

Next we analyze Step 3. The sparsity results above implies that T* = support(0)Usupport(/3) 
satisfies |r*| <p s. Moreover, since support(/3) C T* we have 

A(a,/3)-A(ao,/3o) ^ A(a,^) - A(ao,/3o) <p slogp/n. 
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Thus, by the condition on the sparse eigenvalue requirement in Condition L, Lemma [3] es- 

tabhshes a rate of convergence for post-model selection Logistic regression estimator ||a;^(/3 — 
2,n <p yJs\ogp/n, \a - ao\ <p ^/slogp/n, and ||/3 - /3o||i <p Vs\\$ - /3o||2 < \/s||x-(/3 - 
\2,n/{(f>mm{C's)} <p Sy^logp/n. Moreover, since support(0) C T* we have ||x-(6'*-6'o)||2,n ^ 

lki(^-^o)||2,„ <p V^n-y^and ||^*-0ol|i <P V^\\0*-Oo\\2 < V^\Kie*-eo)\\2,n/{<PminiC's)} <. 

sn~ 



-1/3 



The remaining assumptions in Condition ILOG can be verified as in the proof of Theorem [TJ D 

Proof of Theorem The proof proceeds similarly to the proof of Theorem [T] with modifications 
on the rates achieved in Step 2 for 11x^(6* — 6*0)112,™ and one argument to establish ILOG(iii), 
namely the application of Lemma [6] to establish 



sup|E„[/]|<pi 

/6^ 


/slogpVn / -^^ 
V n \^/e^ 


and bounding 




sup^e^E„[/4]vE[/4] 


< —. — - — T + max,- 

V million w^ 



+ max,^nnvf/wf]) sup5e^E„[|x',5r] < KtR ^ T ''T 



where 5 (z F means ||5||o ^ s and ||x^(5||2,n ^ ^J s\ogp/n. Combining this relation with the 
original bound on supjgjrE[/^] we have 



1/2 



(1^^261) < / ^log(pV^) [ ■slog(pV"-) J / i^gg^logpi^g5log(pVn) I < 0(1)^-1/2 
'^Vnln Vn n / "^ 

provided (i^^^ V Kv)s'^ log^(p V n) ^ (5„n. 

Next we turn to Step 2. The verifications of Conditions WL(i), (ii) and (iii) remain unchanged. 
We will set the standard penalty level A ~ \/nlogp. The first part of Condition WL(iv) follows 



similarly, \\wi-Wi\\2,n ^ ||x-(/3-/3o) + di(ao -ao)||2,n <p ^/slogp/n < {5n/Kr,){K^ ^ AX/^/sn} 
under {K^ + K^)s^log p ^ 6nn. The second part of Condition WL(ii), note that \wi — Wi\ ^ 
\x[{/3 — /3q)\ we have by Lemma[5]and E[ "''^^' Xtji'i] = we have 



max 



E. 



Wi -Wi 

Xi j Vi 
Wi 



<,J-f]^J'^K^<S^n-y^ 



n \ n 



using i^^s^log^(p V n) ^ (5„n. Therefore, by LemmalHwe have \\x[{0 — ^o)||2,n ^p yj s\ogp/n 
and ll^llo ^ Cs with probability 1 — A^. D 
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Appendix C. Auxiliary Results for Penalized and Post-Model Selection 

Estimators 

In this section we state relevant theoretical results on the performance of the £i-penalized 
Logistic regression estimators, heteroscedastic Lasso with estimated weights estimators and the 
associated post-model selection estimators. The analysis of the latter builds upon the analysis 
of Lasso under heteroscedasticity of [2] and it was developed in [^. The analysis of the former 
builds upon the work of [1] that established rates for £i-penalized Logistic regression exploiting 
self-concordance. The main design condition relies on the restricted eigenvalue proposed in [8], 
namely for Xi = {di,x[y 

Kc= mf \\^ix[5\\2,n/\\STl (C.29) 

ll-||,|-,|iiv rii)'ii 11/ \ / 

||otc||i^c||ot||i 
where c = (c-|-l)/(c— 1) for the slack constant c > 1. In the original setting of [8] for least 

squares we have Wi = 1 and it is well known that Kc is bounded away from zero if c is bounded 

for any subset T C {1, ■ ■ ■ ,p} with \T\ ^ s if the sparse eigenvalues of order Cs are well behaved 

(bounded away from zero and from above uniformly) for suitably large constant C. 

C.l. Results for Lasso and Post Lasso with Estimated Weights. In this section we state 
results obtained in [7\ for Post-Lasso estimators with estimated weights, namely the model 

^/w'idi = ^/wlx'fio + Vi, El^/uTiVi I Xj] = (C.30) 

where we observe {{di,Xi) : i = 1, . . . ,n}, i.n.i.d., and only an estimate Wi of the conditional 
variance function Wi = Gi{l — Gi). The support Te,, = support(0o) is unknown but a sparsity 
condition holds, namely \Tqq\ ^ s. Estimators for ^o and Vi can be computed based on Lasso or 
Post-Lasso, namely 

^ G arg minE„[{t;i((ij — x'j6')^] H — ||r0||i and set Vi = yiSjAdi — x^O) , i = l,...,n, (C.31) 
eeRp n 

9 G arg min | En[wi{di - x-(9)2] : Oj = 0, if ^^ = | , set Vi = ^/^i{di - x[^). (C.32) 
where A and F are the associated penalty level and loadings specified below. 



ly depends on 
The following 



However, under the inherent heteroskadasticity, the estimator Wi of Wi typica 
di which can affect the proper choice of A leading to slower rates of convergence. L 
are sufficient high-level conditions where again the sequences A„ and 5„ go to zero and C is 
constant independent of n, and there are sequences Kx and K^ such that maxj^„ ||xj||oo ^ Kx 
and maxjsgn |fi| ^ Ky with probability at least 1 — A„. 



o 

If the heteroskedasticity is only with respect to Xi standard rates of convergence can be derived if the estimated 
weights do not depend on di. 
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Condition WL. For the model ()C.30p . normalize E„[x?] = 1, j = 1, . . . ,p, suppose that 
(i) ll^ollo ^ s where s ^ l,and the weights satisfy < c ^ Wi ^ 1 uniformly in n, 

(«) nvf] > e > 0, „,^ M^lp , C. ,-Hi - .m < Kn^'^ 

(iii) max|(En — E)[t(;jX?ff]| ^ 5„, with probability 1 — A„ 

(iii) we assume that the estimates wi^i = 1,. . . ,n satisfy with probability 1 — A„ 

'Wi -Wi 



Wi - ^Jwi\\2,n ^ T^ T^ /\ —j=—j= and 



K^, \K„ JsJn 



E. 



XiVi 



^ SnPn 



Condition WL(i) is a standard sparsity assumption and could be relaxed in different direction. 
Condition WL(ii) is common in Logistic regression models even with fixed dimensions. Condi- 
tion WL(iii) requires high-level rates of convergence for the estimate Wi and a bound pn for a near 
orthogonality condition. These estimates can be constructed with ^i-Logistic method studied 
here. Condition WL(iii) is trivially satisfied if the true density is used. Finally, several prim- 
itive moment conditions imply the various cross moments bounds in Condition WL(iv). This 
condition is used to apply self-normalized moderate deviation theory to control heteroskedastic 
non-Gaussian errors similarly to [2] where there are no estimated weights. 

The penalty choice potentially depends on pn to account for impact of the estimated weights 
on the orthogonality condition. This additional term can dominate the penalty level. Next we 
present results on the performance of the estimators generated by Lasso with estimated weights. 
Following [2] we call asymptotically valid any penalty loadings F that obey a.s. 

^fo ^ F ^ -uFo, (C.33) 

with < £ ^ 1 ^ u such that i — )-p 1 and u — )-p u' with u' ^ 1. Asymptotic valid options for 
setting the penalty loadings for j = 1, . . . ,p, are 

initial ^j = JEn[wiX^Adi - d)'^], 

y — '-^^ (C.34) 

refined jj = JEni'Wix'fjvf], 

where d := E,n[di] and Vi is an estimate of Vi based on Lasso with the initial option (or iterations). 
[2] established the validity of using either of the choices in ()C.34p . Next we present results on 
the performance of the estimators generated by Lasso and Post-Lasso with estimated weights. 

Theorem 4 (Properties of Lasso and Post-Lasso with estimated Weights). Under Condition 
WL and setting A ^ 6nPn + 2c'^/n^~^{l — j/2p) for c' > c > 1, 7 — t- 0, and using an asymptotic 
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valid penalty loading F, for c = ||ro||oo||rQ ||oo(^c + l)/(ic — 1) we have for n large enough 

< A^s 



w}ix'i(6i-6'o)||2,„ -p . /^^^ 

Moreover, under Condition L(iii), the data- dependent model Tq^ selected by a Lasso estimator 
satisfies with probability 1 — A„; 

\mo = \T9o\<s (C.35) 

Finally, the Post-Lasso estimator obeys 



\x'i{9 - eo)\\2,n <P SnPnV^ + \ ^^ + ^^ and 

n nuc 



<n n u ^ e- / S^ log (^ V Tl) Xs 

\0 - 9o\\l <P 6nPnS + \ ^^ - + 



n UKc 

Theorem U] above establishes the rate of convergence for Lasso and Post-Lasso with estimated 
weights. This leads to bounds on the error between estimated the instrumented instrument 2j 
used in Tabled with respect to the associated valid instrument zoi = Vi/y/wi since 

Zi - zoi = di- xfi -^ =di- xfi - {di - x-6'o} = x\{Oq - 9). (C.36) 



Sparsity properties of the Lasso estimator 9 under estimated weights follows similarly to the 
standard Lasso analysis derived in [2j. By combining such sparsity properties and the rates in 
the prediction norm we can establish rates for the post-model selection estimator under estimated 
weights. 

Comment C.l (Penalty Choice in Step 2). In Step 2 of the proposed method in Table [T] we 
have pn = n^^/'^ so that setting A2 = n^'^ we have 2c!^/n(^~^(l — 7/2p)||r||oo ^ 5n^2- In this 
case the penalty is dominated by the potential bias in the estimation of the weights. Thus 
setting the loadings to 1 with A2 = n?'^ is asymptotic valid. 

C.2. ^i-Penalized Logistic Regression. Consider a data generating process such that 

^_yi I Xi] = G{x[rio) 

which is independent across i (i = 1, . . . , n). Without loss of generality, we assume that ||%||o = 
s ^ 1, E„[x?] = 1 for all 1 ^ j ^ p. First we consider the estimation of r/o via ^i-penalized 
Logistic regression 

r? G argmin Afr?) H ||??||i. (C.37) 

n n 
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Following a general principle used in £i-penalized estimators as discussed in [51 [H [3l [HI [22] , 
under the event that 

-^c||VA(?7o)||oo = c||E„[{yi-G(x^r/o)}xi]||oo, where c> 1, (C.38) 

n 

the estimator in (|C.37p achieves good theoretical guarantees under mild design conditions. Al- 
though 7^0 is unknown, we can set A so that the event in (|C.38p holds with high probability. In 
particular, Remark IE . 1 1 based on Lemma [T2] shows that it suffices to set A = -2--v/ra<&~^(l— 7/[2p]) 
where we suggest 7 = 0.1/ log n. Next we present results for the estimator ()C.37p . 

Lemma 2 (Results for ^i-Penalized Logistic Regression). Assume X/n ^ c||VA(r/o)||oo, c > 1 
and let c = (c + l)/(c — 1) . Then 

, ,^ ,,, , 1 , A-v/i , ,,^ ,, (l + c)(l + c) As 

Wix'S-r]o)\\2,n^S{l + ^J-^ and ||r? - 7?o||i s^ 3^ ^J ^— 2 



HKc C UKi. 



2 



provided that inf^pA „ , i-/r%/9'ir» > 3(1 H — )— ^. Moreover, we have 

|support(^)| ^ 36,^2 min^g^0^ax(m) ^^^ a(^) - A(7?o) ^ 3(1 + i) f ^"j 
where M = {m G N : m > 72c^s0max("i-)//^c}- 

The extra growth condition required for identification is mild. For instance we typically have 



A < Y^log(n Vp)/n and, if the weights wi are bounded away from zero, for many designs of 
interest we have inf^gAc ll^i'^lli n/^riil^i"^!^] bounded away from zero (see p]). For more general 
designs and weights we have 

mr — -- — ; — —rkr > mt — - — - — — — — > 



SeA^ En[wi\x'-6\''^] <5eAc maxj^n ||xi||oo||(^||i ^(1 + c) maxj^„ ||xj||oo 

which implies the extra growth condition under K^s'^log{p V n) ^ Snu'^n. Under the condition 

1/2 
that s/Sn-spaise eigenvalues are bounded away from zero and from above, it follows that s/6n 

belongs to M for n large enough so that |support(r7)| < s under the conditions above. 

In order to alleviate the bias introduced by the £i-penalty, we can consider the associated 
post-model selection estimates. Let T* denote a subset of covariates (selected arbitrarily) and 
define the associated post-model selection estimator 

?j G argmin |a(?7) : 77^- = if j ^ f*| . (C.39) 

Typically T* can be taken as support(7}'). However, we can add additional variables through 
other procedures. (For example, in Step 1 we always include the treatment df, in Step 3 of 
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the double selection procedure covariates selected in a different equation are included.) The 
following result characterizes the performance of the estimator in ()C.39P . 

Lemma 3 (Estimation Error of Post-£i-penalized Logistic Regression). Lets* = \T*\. We have 
II V5-x:(f, - ,„)l|.„ < ^^"^fi""'"- + 3v' Afi)-Afa) 



provided that 



Wix'Min ^ ^ j p^-— ||VA(r/o)|| 



inf /^::;:,;r2 >6max Vf^r^ v-H'/o;iloo ^ ^A(^) - A(r?o) 



no 



^s*+s 



wilx'iSiy^Wl^ [ ^Jcj^^Us* + s) 



Lemma [3] provides the rate of convergence in the prediction norm for the post model selection 
estimator despite of possible imperfect model selection. The rates rely on the overall quality 
of the selected model and the overall number of components s*. Once again, based on the 
results in Lemma [21 the extra growth condition required for identification is mild provided that 
support(7}') C T* and s* is not much larger than s. 

Comment C.2. In Step 1 of the algorithms we use £i-penalized Logistic regression with Xi = 
{di,x'^', 5 := r] — r]Q = (a — ao,/3' — /3q)', and we are interested on rates for ||x^(/3 — /3o)||2,ra 
instead of ||x'j5||2,n- However, it follows that 

\\x'i0 - l3o)\\2,n ^ Pi'^||2,n + |S - ao| • ||di||2,n- 

Since s ^ 1, without loss of generality we can assume the component associated with the 
treatment d, belongs to T (at the cost of increasing the cardinality of T by one which will not 
affect the rate of convergence) . Therefore we have that 

|S-ao| ^ ||(5r|| ^ \\\/w'iXiS\\2,n/Kc- 

In most applications of interest ||(ij||2,n and 1/kc are bounded from above with high probability. 
Similarly, in Step 1 of Algorithm 1 we have that the Post-^i-LAD estimator satisfies 

||x-(^-/3o)||2,n ^ Pi^||2,n [^ + \\di\\2,n/ V^Prainis + s)) . 

Appendix D. Auxiliary Inequalities 

Lemma 4. Fix arbitrary vectors xi, . . . , x„ €W with maxj^„ ||xj||oo ^ K^- Let Q {i = 1, . . . ,n) 
be independent random, variables such that E[|Cj|'?] < oo for some g ^ 4. Then we have with 
probability 1 — 8t 



max |(E„ - E)[4C?]| ^ 4 Ji^^Ml)i^2(E[|^^|.]/,)4/. 
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Proof. The result is derived in Lemma 2 of [5j which follows from a maximal inequality derived 
in [6]. D 

Consider an empirical process Gn{f) = n^^l'^ Z]r=i{/(-^«) ~ E[/(Zj)]} indexed by T , a class 
of pointwise measurable functions (see [20] Chapter 2.3) and assume that G J-". The random 
empirical measure for an underlying independent data sequence {Zj, i = 1, . . . , n} is denoted by 



Lemma 5. Let \hi{t)\ ^ \i!Wi\, Kj^ > cP' := supt(,-T-E[hi{tf^f], and \\T\\i = suptg^- ||t||i. We 
have 



E 



sup\{En-E)[hi{t)^i 



^ ^T\\{£.mn[eiWiii]\\oo] and 



p (supi(E„-E)[/i,(t)g,]i > ^\\^\\}y^^ \ ^ 

32dim(H/.)»p (^) + P {^^SS«,^MWm\ > m) ■ 
Proof. To establish the first relation, by symmetrization for expectation Lemma 6.3 in [T2l 



E 



sup|(E„-E)[/ii(t)e,] 
teT 



^ 2E 



sup|E„[ej/ij(t)^j]| 
.ter 



and Contraction principle Lemma 4.12 in |12j we have 



E 



sup|(E„-E)[/ii(t)ei 



^4E 



sup|E„[eit'PFie»]| 



^4sup||t||iE[||E„[eiH^,Ci]||oo]. 
teT 



By Lemma 2.3.7 in [21], symmetrization for probabilities, we have 



P [snv\Gn{eMt)ii)\ > K/A 



P sup|G„(/ii(t)^i)| > K] ^ 732Tr^2^- I ""^ 

since var(G„(/ii(t)^i)) ^ E[/ij(t)^^j^] ^ a"^. Conditional on {C, Wj} we have 

E[exp(?/'suptg7-|G„(ei/ii(t)Ci)l)] ^ E[exp(4V' sup^g^- ||t||i||G„(eiTyiCi)||oo)] 

^ dim(Wj) • max E[exp{4V'suptg7- ||t||i|G„(eiWjjCj)|}] 

j^dim(Wi) 

^ 2dim(Wi) • exp(8V'^ sup \\t\\l max ¥.n[WM]) 

teT jiiAira(Wi) 

Since we have that P{X > K) ^ min^j.oexp(— ^ivr)E[exp('(/'X)], by choosing the parameter -0 

as V = i^/{16||r||?maXj-^dim(VP'.)IE„[H^2/,^]} it follows 

Pe (snv\Gn{eihi{t)ii)\ > K \ h.^WuiA ^ 8dim(Ty,)exp(-KV{16sup \\t\\l , max W.n[WM]]) 



teT 



teT j^divcL{Wi) 
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The result follows by taking the expectation conditioned on {maXj^dim{Wi) lEn[Wj^-^j^] ^ M}. 

D 

Lemma 6. Suppose that for all < e ^ Eq 

N{£,T,Fn) ^ {uj/e)"^ and N{e,T^,Fn) ^ (w/e)™, (D.40) 

for some lo which can grow with n. Then, as n grows we have 



sup |G„(/)| <p v/mlog(c.Vn) | sup E[f^] + /^M^Z^ (sup E„[/4] V E[f^] 



l/2\ 1/2 



Proof. The result is derived in |4j. D 
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Supplementary Appendix for "Honest Confidence Re- 
gions for Logistic Regression with a Large Number of 
Controls" 

Appendix E. Technical Results and Proofs for Logistic Regression 

In this section our goal is to establish sparsity and rates of convergence of the Post-Lasso 
Logistic estimator. Both of these properties require us to also revisit the analysis of the ii- 
penalize logistic regression (Lasso-Logistic) estimator. In what follows we use a more compact 
notation, specifically r] = (a,/3), Xi = {di,x'^)\ rjo = (ao,/3o)'. Thus the Lasso-Logistic estimator 
is defined as any vector rj such that 

rf £ a,Yg mm A{r]) -\ ||??||i- (E-41) 

v n 

We will also consider the post-model selection Logistic estimator associated with a support 

f* C {l,...,p} defined as 

ry G argminA(r/) : support(r7) C T*. (E.42) 

E.l. Design conditions and Relations. Next we collect relevant quantities associated with 
the design matrix E„[xjX^] and the weighted counterpart EjJtiJjXjX^] where Wi = Gi{l — Gi) G 
[0,1], i = l,...,n, is the conditional variance of the outcome variable m. The non- weighted 
quantities are well studied in the literature (namely restricted eigenvalue, minimum and maximal 
sparse eigenvalues). 

Definition 1. For T = support(r/o), |T| ^ 1, the (logistic) restricted eigenvalue is defined as 

Kc := mm r-— -^ 

||5tc||i^cPtI|i ||c»t|| 

Definition 2. For a subset AcW let 



In this work we will apply this for ^4 = Ac and yl = {(5 G M^ : ||(5||o ^ Cs}. 



QA = inf E„ Ui|x-(5p] / E„ rtt;i|x-(5|^l . 
<5eA ■- -■ '- -' 



The definitions above differ from the their counterpart in the analysis of ^i-penalized least 
squares estimators by the weighting ^ Wi ^ 1. Thus it will be relevant to understand their 
relations through the quantities 



ip(r){c) := mm ..,, -,.. and 'ip(s)(m) := mm ..,, ^n 
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Lemma [7] provides three relationships between the weighted versions and the non- weighted 
versions. Neither dominates the other. Most papers in the hterature focus on the first pair of 
relations which entails to assume that minj<g„ Wi is bounded away from zero uniformly in n. The 
second and third pairs of relations allow for better control in the presence of a few small weights. 
The second pair states that if the average harmonic mean of the weights is bounded the ratio 
between the weighted and non-weighted quantities is controlled by the intrinsic sparsity. / 

Lemma 7 (Relating weighted and non- weighted design quantities). Letting Wi = Gi{l — Gi) we 
have the following inequalities tpi.^^ (c) ^ minj^„ ^/wl and ipu^ (m) ^ minj^„ y/wi] 

V'(r)(c) > 1-,^ , ^ ■ n— 1|— and ^(3)(m) > 



|oo 



^(1 -Fc)maxj<g„ ||xj||oo ^/mmaxjsgn ||Sj| 

where k^ is the original (non-weighted) restricted eigenvalue. Moreover, for any e G (0, 1] we 
have 

II ~ l|2 ^ 1/2 



tp{s){fn) ^ yfeyj (jirmnim) I 1 - E„[l{t(;j ^ e}] 
Proof. The first pair of bounds is trivial since Wi ^ 0. To show the second pair we have 

^ {E„[«;,|x',5|2]}i/2 . {E„,[|x^5|Vu;,]}V2 

^ {E„K|x'i<5|2]}i/2 . {E„[1M]}V2||5||^ maxi^„ p.||^ 

Therefore, for ^s = ||5;i(^||2,n/||^||i we have 



Wix'^S\\2,n ^ \\y/wlx'i&\\2,n 

^i-Slb.n ^ ||v^x^<5||^/^{E„[lM]}i/4||5||i/2jnaxi^„||£,||i/2 

-^m\y.l ^T 1 



Cancelling out ||\/u^^i'^ll2 n/ll^i'^ll2 n ^^d squaring both sides we have 



WiX^&\\2,n \ Q / ||~ II 
Il^'^ll^ ^ ^ "5/ ni&Xj^n ||rCj||oo. 

The result follows by noting that for 5 G Ac we have 'ds ^ n^/{{l + c)-y/s} and for any non-zero 
5 with ||(5||o ^ rra we have -d^ ^ \/<^min(™T/\/"i- 

The third pair follows from noting that 
¥.n[wi\x[5\^]=W.n[wil{wi > e}\3:[6\'^]+Err[wil{wi ^ e}\x'i5\'^] ^ eEn[\x'i5\'^] - eEn[l{w, ^ e}\x'i6\'^] 
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Moreover, by definition of i^s we liave 

\\x'6P 
En[l{wi ^ e}\x'i6\^] ^ En[l{w, ^ e}] max ||xi||^||(5||2 ^ En[l{wi ^ e}] max ||x,||g, " \, '" . 

o 

The result follows. D 

E.2. Identification Lemmas. In this section we collect new identification results for Logistic 
regression that might be of independent interest. We build upon the following technical lemma 
of [U which is based on (modified) self-concordant functions. However we will apply it differently 
than in [I]. We exploit the separability of the objective function across observations and make use 
of the restricted non- linear impact coefficient [3] . In turn this allows us to weaken requirements 
of the analysis when compared to the literature. 

Lemma 8 (Lemma 1 from p]). Let g : M — )• M 6e a convex three times differentiahle function 
such that for all t G M, \g"'{t)\ ^ Mg"{t) for some M ^ 0. Then, for all t ^ we have 

^ {exp(-Mt) + Mt - 1} ^ g{t) - g{0) - g'{0)t ^ ^ {exp(Mt) + Mt - 1} . 

Lemma 9. For t ^ we have exp(— t) + i — 1 ^ ^i^ — ^t^. 

Proof of Lemma\^ For t ^ 0, consider the function f{t) = exp(— i) + t^/6 — t^/2 + t — 1. 
The statement is equivalent to f{t) ^ for t ^ 0. It follows that /(O) = 0, /'(O) = 0, and 
f"{t) = exp(-t) + 1 - 1 ^ so that / is convex. Therefore f{t) ^ /(O) + t/'(0) =0. D 

Lemma 10 (Minoration Lemma). We have that 

A(r?o + 5)- A(r?o) - VA(%)'<^ ^ {\\\V^imln] A {^\\^^m2,n} 
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Proof. Step 1. (Minoration). Define the maximal radius over which the following criterion 
function can be minorated by a quadratic function 

r A(r/o + 5)-A(7?o)-VA(77o)'<5^i||V^x',<^|||„, 1 

rA = sup < r : ^ ' > . 

r y for all 5 e A, \\^/w'ix[5\\2,n ^r J 

Step 2 below shows that r^ ^ Qa- By construction of rA and the convexity of A(r/o + (5) — A(?7o) — 
VA(r?o)'5, 

A(r/o + 5)- A(r/o) - VA(r/o)'<^ ^ 

^ llv^yili,. ^ I \\V^.mb,^ . i^f A(r?o + ~5) - A(r?o) - VA(r^o)'4 
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Step 2. {va ^ Qa) Defining gi(t) = log{l + exp{x'-r]o + tx[6)} we have 

A(r?o + 6)- A(r/o) - VA{r,oy5 = 

= En [log{l + exp(x'i{?7o + S})} - yix[{r]Q + 5)] 

-E„ [log{l + exp(5^7?o) - yi^Wo}] - lEn [(Gj - yi)x'^5\ 
= E„ [log{l + exp(x'J?7o + <5})} - log{l + exp(x'.r/o)} - G^x^J] 
= E„[g,(l)-5i(0)- 1-5^0)] 

Note that the function gi is three times differentiable and satisfies, for Gi{t) := exp(x'^?7o + 
tx-(5)/{l + exp(x-r/o + tx[5)}, 

9[(t) = [x'SG^it), g'lit) = {x',6fG^{t)[l - G,{t)], ^I'it) = {x',SfG,{t)[l - G.(t)][l - 2G.(t)]. 

Thus \g'i'{t)\ ^ \x[S\g'/{t). Therefore, by Lemmas [8] and [9] we have 

9.(1) - 5.(0) - 1 • 9^0) ^ ^fg^ {eM-\m) + l^'.-^l - 1} 

Therefore we have 

A{rjo + 6)- A(7?o) - VA(r/o)'5 ^ iE„ [u;,|x^5p] - 1e„ [w^^Ix^J^ 

Note that for any 5 G A such that ||-y/w^x^(5||2,ra ^ QA we have 

||x'(5||2,n ^ gA ^ ||V^x'i<5|||„/E„ [u-ilx'i^l^] , 

so that E„[?i;j|x'^(5|^] ^ E„[ri;j|x^(5p]. Therefore we have 

A(r/o + 6)- A(77o) - VA(77o)'5 ^ iE„ [w^\3:',6\^] - ^E, [u;.|x^<5|3] 

^ lEn [w^\x',5\^] 

D 

E.3. Penalty Choice and Rate for ^i-Penalized Logistic Regression. Next we establish 
a simple (and known) bound for the choice of the penalty level A within Lasso-Logistic under 
standard normalization. Refinements are possible under additional mild assumptions on the 
covariates. 



any 7 G (0, 1) we have 



Lemma 11 (Choice of Penalty, Hoeffding's Inequality). Assume that E„[x?] = 1. Then, for 



P ||VA(r?o)||oo ^ V21og(2(p + l)/7)/n ^ 7. 
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Proof. Let Gi = E[yi \ Xi] 



exp{x'.r]o) 



, SO that ||VA(r/o)||c 



l+cxp(iJj?o) ' °^ '"'-'■"''' II ^ "'■V'/u;iioo — w^myyt — ^ij -^ijiioo' 



l^niiVi -Gi)Xi 



Then 

i2^ 



Pi\\En[{yi-G^)xi]\\^^t) ^ {p+l)maxP{\En[{y^-Gi)3:iJ]\ ^ t) ^ 2{p + l)exp{-t'n/2). 

D 

Lemma 12 (Choice of Penalty, Self-Normahzed Moderate Deviation Theory). Normalize the 

1, let I 



covariates so that ]E„[5?- 



J — yEn[wiX^-], and Ij = JEn[wixfj]. Assume that 



K? logp ^ nSn minj I'j, ^ ^{^ ~ '^p/l) ^ ^niT^ , o,nd \\wi — Wi\\2^nKx ^ 5„ min^ t^-. Then, setting 
V = diag(/), for any 7 G (0, 1) and fi > 0, for n sufficiently large we have 

P (||f-iVA(r?o)||oo ^ {1 + lJ^]^-\l - 7/[2p])/V^) ^ 7 + 0(1). 



Proof. Let T = diag(/), Ij = jEn[{yi - GiYxfA, and T = diag(0. We have 



|r-ivA(r/o)||oo ^ ||{r-^ - r-1 + r-i - r-i}rr-^vA(r?o)|U + ||r-ivA(r/o)||oo^ 

^ {||{f-i - r-i}r||oo + ||{r-i - r-i}r||oo}||r-ivA(%)||oo + \\r-'vA{r,o) 



€ 



< maxj 



=SP 



h u 



h i, 



+ 1 



} ||r-iVA(r/, 



'Ojlloo- 



+ maxj^p 
Since Wi and Wi are non-negative we have 

maxj<:p \lj - lj\ ^ maxj^p JEn[\wi - Wi\xfj] ^ \\wi - t(^j||2,„ maxj^p{E„[xfj-]}^/^. 
Also, since E[(yj — Gj)^ | Xj] = tfj and for positive number \^/a — Vb\ ^ y/\a — b\, we have 



max 



isSp 



'i 'i 



max. 



j^p 



E„[(y.-Gi)2x2.] 



E[t(;jX 



^2.1 



< Jmax,-^p |(E„ - E)[(yi - G.^xU 



By Lemma S] we have 



max |(E„ - E)[(y, - G,)'^]! <p a/ ^ max{E„[4]}V2 



Jj — il v J, J' I ^ ^/16 under the assumed growth 



Therefore for n large enough we have maxj^p '-^ 

conditions with probability 1 — o(l). In the same event we have 

||f-iVA(%)||oo ^ {1 + /u/2}||r-iVA(7?o) 
Finally, by self-normalized moderate deviation theory we have 

P(||r-iVA(7?o)||oo > i) ^ pmaxP I ^rM - G,)x,, ^ ^ ^ j ^ 2p^-\l - ^/[2p]){l + 0(5^)} 



D 
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Comment E.l. Note that we can replace (wi)^^-^ with {wi)^^-^ in Lemma [T2] if Wi ^ Wi by 
construction. For instance Wi ^ Wi := 1/4. Therefore it is vahd to use A = ^^/n^~^{l — 'y/[2p]) 
and Ij = 1 for c > 1. 

Lemma 13. Assume X/n ^ c||VA(?7o)||oo; c > 1 and let c = (c + l)/(c — 1). Provided that 
QA, > 3(1 + i)AVi/(nKc) 



t(;jXj(7?-r/o)||2,n ^ 3(1 + i) and ||r? - r?o||i ^ 3- 



l + c)(l + c) As 



UK^ 



Proof. Let 6 = rj — r]Q. By definition of rj in ()E.4ip we have A(r/) H — ||ry||i ^ A(?7o) H — ||%||i- 
Thus, 

A(^)-A(77o) ^^||%||i-^||^||i 

However, by convexity of A(-) and Holder inequality we have 



^-^U^Th-k^Sj 



A(^)-A(,?o) ^-||VA(r?o)||oo||<^||i 

Combining these relations we have :||<^t||i — --||'^T=||i ^ ~II'^t||i ll'^T'=||i; which leads to 

||5t=||i^c||5t||i. 

By Lemma [10] with A = Ac and the reasoning above we have 

^llv^^'^-^lli,, A {fw^^i'^sy^^] ^ A(^) _ A(^o) _ VA(r?o)'<5 

^^l|5T||i-^PT^||i + ||VA(r/o)||oo||<5||i 
^(l + ^)^ll'5T||i^(l + i)^||<5T|| 
^a + -c)^\\V^i3:[S\\2,n/fic 

Provided that qa > 3(1 H — )Xy/s/{Kcn), so that the minimum on the LHS needs to be the 
quadratic term, we have 

||^/^X,5||2,n^3(l + i)^ 

riKc 

U 

E.4. Sparsity of Lasso-Logistic. We begin by establishing sparsity bounds which do not 
rely on large penalty choices nor on the irrepresentable condition (namely the assumption 
||En[5;iT<=3;ir](]En[a;iTXjr])~^sign(77oT)||oo < 1)- The (data-driven) sparsity is fundamental for the 
analysis of the rate of convergence of the Post-Lasso-Logistic estimator. The following lemma 
is useful. 
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Lemma 14. The logistic link function satisfies \G{t + to) — G{tQ)\ ^ G'(to){exp(|t|) — 1}. // 
\t\ ^1 we have exp(|i|) — 1 ^ 2\t\. 

Proof. Note that \G"{s)\ ^ G{s) for all s. So that -1 ^ ^log(G'(s)) = g^ ^ 1. Suppose 
s ^ 0. Therefore 

-s<:\og{G'{s + to))-\og{G'{tQ))^s. 

In turn this implies G'(to) exp(— s) ^ G' {s + to) ^ G' {to) exp(s). Integrating one more time from 
to t, 

G'(to){l - exp(-t)} ^ G{t + to) - G(to) ^ G'(to){exp(t) - 1}. 

The first result follows by noting that 1— exp(— t) ^ exp(t) — 1. The second follows by verification. 

D 



Lemma 15 (Sparsity). Consider rj as defined in ( E.4-1^ - Suppose \/n ^ c||VA(?7o)||oo then for 
s = |support(r/)| 

« ^ 7 T\2-'?^max(s)||a;i(r/ - r/o)||2 „. 



3(l+c)(l+c) A. 



Moreover, if -^ ^ ^^^maxj^„ ||a;i||oo ^ 1 we have 



vs ^ be— \/s and s ^ 6oc mm ^ 

where M = {m £ N : m > 72sc^ (pmaxim) / k^} 

Proof. Let T = support(r/), s = \T\, 6 = rj — ■i]q, and Gi = exp{x[r])/{l + exp{x'^)}. For any 
i G T we have |VjA(r?)| = |E„[(yj - Gi)xij]\ = X/n. 

The first relation follows from 

^V? = mniiVi - Gi)x^f]\\2 

^ WKM - G.)5^^]||2 + ||En[(G, - G,)X^^]||2 
^ Vf\\En[{y^ - Gi)x.f]\\^ + ||E„[x'i5x.^]||2 

^ ^Vs+ V(/'max(s)||x^<5||2,„ 

The second relation follows from 

^Vs =||E„[(yi-Gi)5.^]||2 

^ ||E„[(yi - Gi)5.^]||2 + ||E„[(Gi - G,)5.^]||2 

^ Vs\\En[{yi -Gj)x.^]||oo +sup||g||^^|^|^||g||^-^E„[|Gi - Gi\ ■ \x[0\] 

^ ^Vs + 2V0max(s)||V^x'i5||2,n 
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where we used Lemma [T5] so that \Gi — Gi\ ^ tt;i2|x'j(5| smce by Lemma [T3]||(^||i ^ 3^^ — '^ — ^^ 
so that maxjsgn ||a;i||ooPI|i ^ 1 by the assumed condition. 

Therefore, by the || • ||2,n bound in Lemma [T3l we have 

which imphes vs ^ 6c ^ — ^/s. 

The last relation follows by the previous result and the fact that sparse eigenvalues are sub- 
linear functions. D 

E.5. Post model selection Logistic regression rate. 

Lemma 16. Consider rj as defined in {E.42). Let s* := \T*\. We have 



/Wix'iil] - 7]o)\\2,n ^ V3^/A{fl) - A(7?o) + SVs* + s|| VA(7?o) ||oo/\/</'min(s* + 



provided that qa/Q > Vs* + s||VA(r/o)||oo/\/<Amin(s* + s) and qa/Q > ^/Mv) - Hvo) for A = 
{5gW: \\6\\oi^s* + s}. 

Proof. Let 5 = 7? - i]o and t2,n = \\y/w'ix[5\\2,n- By Lemma [TOl with A = {6 eW^ : \\6\\o ^s* + s}, 
we have 

¥ln A {fh,n} ^ A(r/) - A(%) - VA(7?o)'5 

^A(r/)-A(r?o) + ||VA(77o)||oo||^1|i 



^ A(r/) - A(7?o) + i2,nVs + s||VA(r/o)||oo/V'^min(s + s) 

Provided that qa/G > Vs* + s||VA(7?o)||oo/\/0min(s* + s) and qa/G > VHv) - Hvo), if the 
minimum on the LHS is the linear term, we have t2,n ^ V^iv) ~ ^(^o) which implies the result. 
Otherwise, since for positive numbers a'^ ^ b + ac implies a ^ vb + c, we have 

kn ^ V^VMv) - A(r/o) + 3Vs* + s||VA(r?o)||oo/V0min(s* + s). 

D 
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